Congrats! You’ve made it to the end of the R workshop! We have prepared this documentation for you, where you will find all the code we wrote during the workshop. Don’t hesitate to go through it again at your own pace to make sure you understand everything ;-)
Have fun with R!
Celes, Lander, Lisa, Mustafa and Nikolas
## [1] "/Users/nikolasbasler/Library/CloudStorage/OneDrive-KULeuven/PhD/Teaching/Introduction-to-R"
## [1] "dplyr"
## [1] "tidyr"
## [1] "xlsx"
##
## sum> ## Pass a vector to sum, and it will add the elements together.
## sum> sum(1:5)
## [1] 15
##
## sum> ## Pass several numbers to sum, and it also adds the elements.
## sum> sum(1, 2, 3, 4, 5)
## [1] 15
##
## sum> ## In fact, you can pass vectors into several arguments, and everything gets added.
## sum> sum(1:2, 3:5)
## [1] 15
##
## sum> ## If there are missing values, the sum is unknown, i.e., also missing, ....
## sum> sum(1:5, NA)
## [1] NA
##
## sum> ## ... unless we exclude missing values explicitly:
## sum> sum(1:5, NA, na.rm = TRUE)
## [1] 15
## starting httpd help server ... done
## [1] "Hello"
## [1] "say \"Hello\""
## [1] 8
## [1] 7
## [1] 24
## [1] 5
## [1] 8
## [1] 8
## [1] 1
## [1] 15
## [1] 15
## [1] TRUE
## [1] TRUE
## [1] TRUE
R has 5 basic data types that are :
* characters
* numerics
* integers
* complexes
* logicals
A character is a text (or string) value typed between single (’’) or double (““) quotes.
#To assign a value to a variable, you need to use the command: variable <- value
my_character <- "hello there"
#To print out the data type of a variable, you can use the function typeof() or class()
typeof(my_character)## [1] "character"
A numeric is a decimal value (also called double-precision real value).
## [1] "double"
Integers are a subset of the numeric data type. Integers are whole values and are specified by the letter “L”.
## [1] "double"
## [1] "integer"
A complex number is a value with real and imaginary parts.
## [1] "complex"
R can also be used as a calculator. R follows the standard order of
priority and accepts the following operators:
Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
x <- 4+2*6
y <- 2^10
#To print out the value of a variable, you can use the function print() or just type the name of the variable
print(x)## [1] 16
## [1] 1024
R’s basic data structures include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g., lists, data frames).
A vector is the simplest and most used data structure in R. It is a
collection of data values of the same type (e.g., all numerics,
characters, logicals…). Vectors are called often atomic vectors as they
can only accommodate one type of R data.
The easiest way to create a vector is to use the c() function that
concatenates several given elements.
color <- c("blue", "green", "orange", "red", "yellow") #vector of characters
size <- c(39, 43, 38, 41) #vector of numerics
phd <- c(TRUE, FALSE, FALSE) #vector of logicalsLet’s have a look at the attributes of the vectors we just created:
## [1] "character"
## [1] 4
## logi [1:3] TRUE FALSE FALSE
When combining vectors of different data types, one data type will
outrule the other one(s). This conversion between types is called
“coercion”.
When R converts the mode of storage based on its content, it is referred
to as “implicit coercion”.
R applies the following rule is: Character > Numeric > Integer
> Logical.
## [1] "blue" "green" "orange" "red" "yellow" "39" "43" "38"
## [9] "41"
## [1] "character"
You can also control the coercion using as.<data_type>(). This is referred to as “explicit coercion”.
## [1] "39" "43" "38" "41"
## [1] "character"
Manipulation of vectors: adding elements, using missing data, retrieving elements…
Adding an element
## [1] "blue" "green" "orange" "red" "yellow" "white"
## [1] 40 39 43 38 41
Manipulating missing data
Missing values are represented by NA (Not Available)
without quotes. NA represents both missing character and
numeric data. NA values have a class also, so there are integer NA,
character NA, etc… Impossible values (e.g., dividing by zero) are
represented by the symbol NaN (Not A Number).
is.na() is used to test objects if they are NA
is.nan() is used to test for NaN
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
Retrieving elements
## [1] "green"
## [1] "blue" "green" "orange"
## [1] "green" "red"
## [1] "blue" "orange" "red" "yellow"
## [1] 43 41
Matrices are vectors with two dimensions: the number of rows and the number of columns. The dimension attribute is itself an integer vector of length 2 (nrow, ncol). Matrices can hold numeric, character or logical values. The elements in a matrix all have the same data type.
m <- matrix(nrow = 2, ncol = 3) #to create a matrix with 2 rows and 3 columns
m #to print the content of matrix m. The matrix will be filled with NAs if no values given.## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
## [1] 2 3
Matrices are build column-wise so the entries will be put first in all the rows of column 1 before moving to column 2.
stud1 <- c(14,15)
stud2 <- c(10,11)
stud3 <- c(19,17)
grades <- matrix(c(stud1, stud2, stud3), nrow = 2, ncol = 3)
grades## [,1] [,2] [,3]
## [1,] 14 10 19
## [2,] 15 11 17
You can change the names of the columns and the rows using colnames() and rownames().
rownames(grades) <- c("test A", "test B")
colnames(grades) <- c("student 1", "student 2", "student 3")
grades## student 1 student 2 student 3
## test A 14 10 19
## test B 15 11 17
You can add columns or rows to a matrix using cbind() or rbind(), respectively.
## student 1 student 2 student 3 stud4
## test A 14 10 19 13
## test B 15 11 17 15
## student 1 student 2 student 3 student 4
## test A 14 10 19 13
## test B 15 11 17 15
Data often falls into a limited number of categories. For example, at
a sport competition, athletes could get gold, silver, bronze or no
medal. In R, categorical variables are stored in factors. Factors can be
unordered or ordered. Factor objects can be created with the
factor() function.
## [1] no silver bronze no no gold
## Levels: bronze gold no silver
## bronze gold no silver
## 1 1 3 1
The levels are ordered by alphabetical order by default. The order of
the levels of a factor can be set using the levels argument
to factor().
medal <- factor(c("no", "silver", "bronze", "no", "no", "gold"), levels = c("gold", "silver", "bronze", "no"))
summary(medal)## gold silver bronze no
## 1 1 1 3
Unlike vectors, the contents of a list are not restricted to a single type and can encompass any mixture of data types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object. This property makes them fundamentally different from atomic vectors.
You can create lists using list().
An empty list of the required length can be created using vector().
You can name the components of a list.
features <- list(eyecolor = c("blue", "brown", "black"), age = c(18,23,25), siblings = c(FALSE, TRUE, TRUE))You can also name the components of a list using names().
Retrieving components of a list
To retrieve a component of a list, you can use [[]], using either the
position of the element in the list or the name of that component. You
can also use the “$” sign followed by the name of the component.
## [1] "blue" "brown" "black"
## [1] 18 23 25
## [1] FALSE TRUE TRUE
You will mostly work with data frames if you want to import and
analyze datasets in R. A data frame is a special type of list where
every component of the list has same length. Unlike matrices, data
frames can store different classes of objects in each column. R has some
built-in data frames that you can use to practice some manipulations.
Let’s have a look at the iris dataset.
Here are some useful functions for data frames:
To show the first 6 rows: head()
To show the last 6 rows: tail()
To print the dimensions of data frame (i.e. number of rows and number of
columns): dim()
To print the number of rows: nrow()
To print the number of columns: ncol()
To print the structure of a data frame (i.e. name, type and preview of
data in each column): str()
To show the names attribute for a data frame: names() or
colnames()
To show the class of each column in the data frame:
sapply(<dataframe>, class)
Let’s have a look at our dataset:
## [1] 150 5
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Retrieving elements
To retrieve an element based on its column and row numbers
## [1] 1.4
## [1] 3.5 3.0 3.2
To select one column from the data frame
## [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3
## [19] 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2
## [37] 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4 1.5 1.5 1.3
## [55] 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3 1.4 1.5 1.0 1.5 1.1 1.8 1.3
## [73] 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0 1.1 1.0 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3
## [91] 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8
## [109] 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8
## [127] 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5 1.4 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3
## [145] 2.5 2.3 1.9 2.0 2.3 1.8
Filtering the rows of a data frame
iris[iris$Species %in% c("setosa", "virginica"),] #subsets the flowers (rows) that belong to "setosa" or the "virginica" species.Operations on the columns of a data frame
iris$Ratio.Sepal.Petal <- iris$Sepal.Length/iris$Petal.Length #creates a new column to store the ratio of the sepal legnth to the petal length
iris$Av.Sepal.Length[iris$Species=="setosa"] <- mean(iris$Sepal.Length[iris$Species=="setosa"])
iris$Av.Sepal.Length[iris$Species=="versicolor"] <- mean(iris$Sepal.Length[iris$Species=="versicolor"])
iris$Av.Sepal.Length[iris$Species=="virginica"] <- mean(iris$Sepal.Length[iris$Species=="virginica"])You can also use a loop to perform the last 3 commands at once:
species <- unique(iris$Species)
# Use a for loop to calculate and assign the average sepal length for each species
for (s in species) {
iris$Av.Sepal.Length[iris$Species == s] <- mean(iris$Sepal.Length[iris$Species == s])
}Export a data frame You can save a data frame as a
.tsv (tab-separated values) file or a .csv (comma-separated values) file
using the write.table() function.
write.table(iris, file = "iris_R.tsv", sep = "\t", row.names = TRUE, col.names = TRUE)
write.table(iris, file = "iris_R.csv", sep = ";", row.names = TRUE, col.names = TRUE)Import a dataset as a data frame You can import a
dataset as a data frame using the read.table()
function.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Store the data in the variable my_data
Print the first 6 rows or view all of it.
Compute the mean value for sepal length
## [1] 5.843333
Compute the median value
## [1] 5.8
Compute the mode, using modeest: mfv!
Range: minimum & maximum: Range corresponds to biggest value minus the smallest value. It gives you the full spread of the data.
Compute the minimum value
## [1] 4.3
Compute the maximum value
## [1] 7.9
Range
## [1] 4.3 7.9
Quartiles and quantiles Recall that, quartiles divide the data into 4 parts. Note that, the interquartile range (IQR) - corresponding to the difference between the first and third quartiles - is sometimes used as a robust alternative to the standard deviation. R function:
quantile(x, probs = seq(0, 1, 0.25))
x: numeric vector whose sample quantiles are wanted. probs: numeric vector of probabilities with values in [0,1].
by default, you will get quartiles
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
lets compute deciles (0.1, 0.2, 0.3… 0,9)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 4.30 4.80 5.00 5.27 5.60 5.80 6.10 6.30 6.52 6.90 7.90
#To compute the interquartile (IQR) range: IQR means=“midspread, middle 50%”
## [1] 1.3
#Variance and standard deviation The variance represents the average squared deviation from the mean. The standard deviation is the square root of the variance. It measures the average deviation of the values, in the data, from the mean value.
Compute the variance
## [1] 0.6856935
Compute the standard deviation
## [1] 0.8280661
Median absolute deviation The median absolute deviation (MAD) measures the deviation of the values, in the data, from the median value.
Compute the median
## [1] 5.8
Compute the median absolute deviation
## [1] 1.03782
summary() function = Can be used for overall summary of dataframe or a variable
Summary of a single variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
Summary of a data frame. Previous function is applied to each column. Depending on the data in each column, result will differ.
If the column is a numeric variable, mean, median, min, max and quartiles are returned. If the column is a factor variable, the number of observations in each group is returned.
Let’s use it:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species Ratio.Sepal.Petal Av.Sepal.Length
## setosa :50 Min. :1.050 Min. :5.006
## versicolor:50 1st Qu.:1.230 1st Qu.:5.006
## virginica :50 Median :1.411 Median :5.936
## Mean :2.018 Mean :5.843
## 3rd Qu.:3.176 3rd Qu.:6.588
## Max. :4.833 Max. :6.588
If you do not want to see 3 digits after dot (lets limit it to one digit):
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4 Min. :2 Min. :1 Min. :0.1 setosa :50
## 1st Qu.:5 1st Qu.:3 1st Qu.:2 1st Qu.:0.3 versicolor:50
## Median :6 Median :3 Median :4 Median :1.3 virginica :50
## Mean :6 Mean :3 Mean :4 Mean :1.2
## 3rd Qu.:6 3rd Qu.:3 3rd Qu.:5 3rd Qu.:1.8
## Max. :8 Max. :4 Max. :7 Max. :2.5
## Ratio.Sepal.Petal Av.Sepal.Length
## Min. :1 Min. :5
## 1st Qu.:1 1st Qu.:5
## Median :1 Median :6
## Mean :2 Mean :6
## 3rd Qu.:3 3rd Qu.:7
## Max. :5 Max. :7
It’s also possible to use the function sapply() to apply a particular function over a list or vector. For instance, we can use it, to compute for each column in a data frame, the mean, sd, var, min, quantile, …
Compute the mean of each column
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
## Ratio.Sepal.Petal Av.Sepal.Length
## 2.018051 5.843333
what is “[, -5]”? : It means drop the fifth column while doing the calculation. Why? Because it is not numeric, function cannot calculate that.
Compute quartiles
## Sepal.Length Sepal.Width Petal.Length Petal.Width Ratio.Sepal.Petal
## 0% 4.3 2.0 1.00 0.1 1.050000
## 25% 5.1 2.8 1.60 0.3 1.230469
## 50% 5.8 3.0 4.35 1.3 1.410603
## 75% 6.4 3.3 5.10 1.8 3.176471
## 100% 7.9 4.4 6.90 2.5 4.833333
## Av.Sepal.Length
## 0% 5.006
## 25% 5.006
## 50% 5.936
## 75% 6.588
## 100% 6.588
We can also use apply(), However, we will need to specify if we want to work with columns or rows. We’ll start with the main function of the apply group: apply(). It takes a DataFrame, a matrix, or a multi-dimensional array as input and, depending on the input object type and the function passed in, outputs a vector, a list, a matrix, or an array.
The syntax of the apply() function is very simple and has only three parameters:
apply(X, MARGIN, FUN)
X=dataframe MARGIN=(it can take values 1, 2, or c(1,2), meaning that the function is applied row-wise, column-wise, or both row- and column-wise, correspondingly) FUN=function
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.800000 3.000000 4.350000 1.300000
## Ratio.Sepal.Petal Av.Sepal.Length
## 1.410603 5.936000
The %>% symbol in R is called the pipe operator, and it comes from the magrittr package, which is also imported by the dplyr package. It is used to pass the result of one expression as the first argument to the next expression. This can make your code more readable and reduces the need for nested function calls.
summarize(
filter(iris, Species == 'setosa'),
mean_sepal_length = mean(Sepal.Length),
mean_sepal_width = mean(Sepal.Width)
)library(dplyr)
iris %>%
filter(Species == 'setosa') %>%
summarize(
mean_sepal_length = mean(Sepal.Length),
mean_sepal_width = mean(Sepal.Width)
)We want to group the data by Species (using group_by) and then: compute the number of element in each group. R function: n() compute the mean: R function mean() and the standard deviation: R function sd() dont forget to ignore NA values for calculation functions mean and sd
Let’s use group_by, which takes our data, and groups it by species. Then, we use summarize()
Quest: group_by Species, count how many entries are present in each group (using n()), calculate mean and standard deviation.
group_by(my_data, Species) %>%
summarise(
count = n(),
mean = mean(Sepal.Length, na.rm = TRUE),
standarddeviation = sd(Sepal.Length, na.rm = TRUE)
)We can make it a new variable, so we can manipulate if needed…
sepal_length_table <- group_by(my_data, Species) %>%
summarise(
count = n(),
mean = mean(Sepal.Length, na.rm = TRUE),
standarddeviation = sd(Sepal.Length, na.rm = TRUE),
median = median(Petal.Length, na.rm = TRUE)
)The result of this pipeline will be a new data frame where each row corresponds to a unique Species in the original my_data. Each row will have the count of observations, the mean of Sepal.Length, and the standard deviation of Sepal.Length for that Species. This type of summary is particularly useful for understanding the distribution of measurements within categories of a categorical variable.
Let’s try a new dataset.
len: Tooth length supp: Supplement type (VC or OJ).two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC). dose: numeric Dose in milligrams/day
Can you group this dataset by dose and calculate mean, count and standard deviation for each group?
group_by(ToothGrowth, dose) %>%
summarise(
count = n(),
mean = mean(len, na.rm = TRUE),
stdev = sd(len, na.rm = TRUE),
)Can you group this dataset by supp and calculate mean, count, standard deviation, median, min and max?
group_by(ToothGrowth, supp) %>%
summarise(
count = n(),
mean = mean(len, na.rm = TRUE),
stdev = sd(len, na.rm = TRUE),
median = median(len, na.rm =TRUE),
min = min(len),
max = max(len),
)Graphical representations of your data can help to see structures and depict comparisons. There are many ways to visualise data and R comes with quite powerful and versatile tools to create all kinds of plots.
Base R (meaning R without any additional packages installed) already
has some functions to plot graphics from data, but these functions are
very rudimentary and are not nice to work with. Luckily, there is a
package called ggplot2, which unifies many different types
of visualisations under one umbrella, using consistent syntax. Here we
will only scratch the surface but I highly recommend that you
familiarise yourself more with it. This is a long but very good video
that explains the principles of ggplot2: https://www.youtube.com/live/h29g21z0a68?si=DdodJAcWyM0EwCI8.
We could install and load ggplot2 separately, but you
all should have tidyverse installed, which is a collection
of several packages. ggplot2 is one of them, and so are
some others that we will use in this exercise. So all we need to do is
to load tidyverse and we will have access to all these
helpful functions, including ggplot2. For more info on
tidyverse and its contained packages (including useful
cheat sheets), see: https://www.tidyverse.org/
Along the way of this exercise, there will be a few small tasks that usually only require you to change one or two parameters of the previous command. At the end of the exercise, there are several tasks to recreate the plots we made on a different data set. You will see the plot that you are supposed to create underneath the task. Look at it so you have an idea what you should do. Below the plot you will find the solution. No peaking!
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks pastecs::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks pastecs::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks pastecs::last()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
How exactly you want to visualise your data depends on the data itself and the the properties you wish to highlight. The visualisations we will be looking at in this exercise are
For this exercise we will be using a the iris data set,
which is included in R.
Load an R-internal data set and create a dataframe called
iris.
Look at the structure of the dataframe:
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Look at the first 6 rows of the dataframe:
View the entire dataframe in a separate tab:
Let’s see how many measurements of each species we have:
##
## setosa versicolor virginica
## 50 50 50
## [1] 5.843333
With tidyverse loaded, instead of “printing” data to the
console or assigning it to a new variable, you can also “pipe” it into a
function by using %>%. The output of that function can
then be piped into the next function and so on. This makes it easier to
keep track of your code while avoiding to define too many variables.
Tip: Build a pipe step by step to make sure that the output of one step is what it should be before you pipe it into the next function. Remember that when you are working in a script, hitting CTRL-ENTER will execute the whole pipe where your cursor is at that time, which may span over several lines. If you want to execute parts of a pipe (or of any code) first mark the text and then hit CTRL-ENTER. If you want to assign the outcome of the pipe to a variable, do that after you have built the pipe and you are happy with the outcome.
Let’s build this first pipe step by step together. Later I will only give you the complete pipe but I strongly recommend that you always build them step by step so you know what is going on and can easier trouble-shoot.
We start by calling the original data set, and confirm that it is what we think it is.
Now we pipe the dataframe into a function, in this case
group_by().
Visually not much has changed, but we get the additional info that we now have a group with 3 members in the Species column (not shown in the HTML file but you should see it in the top-left corner of output in RStudio). Following calculations will now be done separately on each group instead of the entire dataframe or column.
So let’s pipe this into the next function: summarise().
This function will create a new dataframe, with columns that we can
define. Since we are giving it a grouped dataframe,
summarise() will create a row for each member of the group,
i.e. for each species.
iris %>%
group_by(Species) %>%
summarise(mean_sepal_length = mean(Sepal.Length),
mean_sepal_width = mean(Sepal.Width),
mean_petal_length = mean(Sepal.Length),
mean_petal_width = mean(Petal.Width)
)This is looking good. So let’s assign the output of this pipe to a new variable. We will use it further down.
Maybe the simplest type of plot is a scatter plot. Given pairs of values, it will draw them as points on a plane.
For this example, let’s filter down the iris data set to
only include the virginica species.
And now let’s create a first plot. We call ggplot() and
state which data we want to plot and aes(), the
“aesthetics”, (where we define what should be on the x and what on the y
axis). After the ggplot() call, we add “geometries” using a
+. In this case, we add a geom_point() but we
can also add more or different geometries, such as
geom_smooth() (to draw a curve that follows the points) or
geom_line() (to connect the points with lines). Play around
with adding or removing them!
ggplot(only_virginica, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_point() +
geom_smooth() +
geom_line() ## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
You should now see a plot appearing in the bottom-right corner of RStudio. You can resize that window or click on Zoom to open a new window that only contains this plot. You can save the plot to a file by clicking on Export.
It’s also possible to store a plot into a variable
To visualise the plot again, simply call the variable:
If you have stored the plot in a variable, you can also save it using
the function ggsave(). You will have to adjust the
filename parameter to point to a folder on your
computer.
Task: Make a scatter plot that shows the sepal length of Iris
versicolor on the x axiis and its petal length on the y axis.
# Solution:
only_versicolor <- iris %>%
filter(Species=="versicolor")
ggplot(only_versicolor, aes(x=Sepal.Length, y=Petal.Length)) +
geom_point()During the course, someone asked how to plot the coordinates next to
each point. This is how I would do it: We first need an additional
column in our dataframe. Inside a pipe we can use the
mutate() function for this. To generate the text from the
values of Sepal.Length and Sepal.Width we can
use paste(). By default, paste() will put an
empty space between the elements it combines. Let’s use a slash as a
separator instead by specifying sep="/". Remember to build
the pipe step by step so you confirm that the new column is created and
contains what you want before assigning the pipe to a new variable.
virginica_with_coords <- only_virginica %>%
mutate(coodinates_SL_SW = paste(Sepal.Length, Sepal.Width, sep="/"))Now we can plot this new data frame. In the aes() call
we add a label parameter, assigning the
coodinates_SL_SW column to it that we just created. We also
have to add a geom_text() and change the vertical and
horizontal adjustment, so we don’t plot over the points.
ggplot(virginica_with_coords, aes(x=Sepal.Length, y=Sepal.Width, label=coodinates_SL_SW)) +
geom_point() +
geom_text(hjust=0, vjust=0)It’s a bit messy and would need some effort to make look good. You
could change the font size of the labels and/or the colors or install an
additional package, e.g. ggrepel, that is made for this
purpose. Making plots look pretty, not only functional, often takes the
most time, but I think you get the idea, so we will leave it like this
for now.
Earlier we created this mean_flower_sizes dataframe,
containing the mean leaf lengths. Have another look at it and then
filter it down to only include the mean sepal length.
m_sep_len <- mean_flower_sizes %>%
select(c(Species, mean_sepal_length)) # c() is necessary, because we specify more than one column.And now make a bar chart with this filtered-down data set.
Some features of ggplot() work best if the data is in a
long (or so-called tidy) format, so we will have to pivot the table by
all columns except the Species column. Take a moment to understand what
pivot_longer() does. The tables before and after the
function call contain the same information, but it is presented
differently.
Now that we know how to pivot our data into a long format, we can
plot a grouped bar plot. For this we add a fill parameter
to the aes() call and position="dodge" to
geom_col(). If you leave out the latter, you would get a
stacked bar plot, which doesn’t make a lot of sense here, but try it
out! Also, you can pipe data directly into ggplot()! When
you do that, you don’t need to specify the data to plot anymore,
ggplot() will automatically use the data that is coming
through the pipe. When you are done with the plot, assign it to a
variable called p.
p <- mean_flower_sizes %>%
pivot_longer(-Species) %>%
ggplot(aes(x=name, y=value, fill=Species)) +
geom_col(position='dodge')
p # After assigning the plot to a variable you can display it again by calling the variable.After we assigned the plot to a variable, we can still continue to
change it by calling the variable and adding things using a
+.
For example, we could add an arbitrary horizontal line using
geom_hline(). Add the parameters of this function one by
one to see what they do.
We can also add a title or change the axis to a log scale. Add the
following lines of code one by one to see what each of them is doing but
remember that a command should not end with a +. So type a
line without the + at the end, execute it, look what
happened, add the + and the next line withouth the
+ at the end, execute it and so on.
p + ggtitle("Size of Iris flowers") + # This will add a title
theme(plot.title = element_text(size=14, face="bold", hjust=0.5)) + # Make the title bigger and centered
scale_y_continuous(trans="log10") + # Change the y axis to a log scale
coord_cartesian(ylim=c(-0.4,3)) # Limit the y axis to an arbitrary range## Warning in scale$trans$transform(coord_limits): NaNs produced
Task: Create a grouped bar chart where the groups are the species and the bars in the groups are the mean leaf lengths for that species.
So far we have visualised mean values. That’s nice but we lose the information about how the data was distributed. This is where box plots come in handy.
To start with an easy example, let’s filter down our
iris data set once more and plot a single box.
iris %>%
filter(Species=="setosa") %>%
select(Petal.Length) %>%
ggplot(aes(y=Petal.Length)) +
geom_boxplot()This is what you need to know about boxplots:
Let’s go back to the original dataframe and make a new long table,
this time on all the measurements. We want to make a plot with 4 groups
of boxes, one for each type of measurement and in each group 3 boxes,
one for each species. In other words, we would like to group the box
plot as we have grouped the bar plot. Unsurprisingly, we code this in
the same way as well. See what happens if you leave out the
fill parameter!
# Let's save the long (or "tidy") table into a new variable, because we will need it later again:
tidy_iris <- iris %>%
pivot_longer(-Species)
ggplot(tidy_iris, aes(x=name, y=value, fill=Species)) +
geom_boxplot()Task: Make a box plot that only contains the values from the
versicolor species.
Histograms can give you a more detailed impression of a distribution.
Histograms put the values that you give it into bins and then display
how many values are in each bin. That’s essentially all there is to it!
You can manipulate the resolution by changing the width of the bins
either directly by setting the binwidth parameter, or
implicitly but setting the number of bins with the bins
parameter. Play around with this parameter!
Let’s start with a filtered-down example.
iris %>%
filter(Species=="setosa") %>%
select(Sepal.Length) %>%
ggplot(aes(x=Sepal.Length)) + # Only x is required. y is calculated by the geometry.
geom_histogram(binwidth = 0.1)Similarly to what we did with the box plot, we can also add several histograms simply by specifying a fill.
Note: When you add several histograms, the counts inside the
bins from the different histograms will add up! This is typically not
what we want, so we will have to tell the geometry to not add up
anything and use the values as they are by using the parameter
position="identity". But then we have the problem that some
parts of the histograms will be hidden behind others. To make them
transparent, we can choose an alpha parameter of less than
1.
iris %>%
ggplot(aes(x=Petal.Length, fill=Species)) +
geom_histogram(binwidth = 0.08, position="identity", alpha=0.8)Density plots have a similar purpose as histograms. But instead of bars representing bins, a density plot will draw a smooth curve. Let’s create one and discuss what we are seeing. To make a density plot, you can use the exact same input as for the histogram, but add a different geometry.
If you understand a histogram you will have an intuitive understanding of what a density plot is. To be more precise about what is happening, imagine all the Petal.Length values lined up on the x axis. Then a curve is drawn around each data point. The density plot is then made by adding up all the curves. Areas where many data points are close to each other will have more to add and therefore reach higher on the y axis, while areas with low density will be lower on the y axis.
Note that, unlike with histograms, the y axis in a density plot does not represent count data anymore.
As we have seen before, you can also add different geometries into the same plot. So if you wanted, you could plot a histogram and a density plot together.
iris %>%
ggplot(aes(x=Petal.Length, fill=Species)) +
geom_histogram(binwidth = 0.08, alpha=0.8, position="identity") +
geom_density(alpha=0.5)Now that you know what a box plot and a density plot is, you can easily understand violin plots. They are essentially a combination of both! For a violin plot, you can use the same input as for a box plot and simply change the geometry.
This plot is a bit unclear. The violins are very slim and it’s hard
to see what’s going on. So let me show you another feature of
ggplot2, by which you can arrange different groups of your
plots in so-called facets. Simply add facet_wrap() to your
plot and tell it by which column of your data the plot should be
divided.
The draw_quantiles parameter lets you draw horizontal
lines into the violins at different quantiles. You can choose which and
how many quantiles you want to have, here I used the
first, second and third quartile, similar to a box
plot.
The scales="free_x" allows the facets to have different
x axes, which in this case means that it prevents empty spaces. As
always, see how the plot looks like without these parameters! Also see
what happens if you use scales="free_y" or
scales="free" instead.
ggplot(tidy_iris, aes(x=name, y=value, fill=Species)) +
geom_violin(draw_quantiles=c(0.25, 0.5, 0.75)) +
facet_wrap(~name, scales="free_x")As you can see, a violin plot resembles a box plot but now the width of the violin has a meaning. It is essentially a density plot instead of a box. In fact, it’s the same density plot drawn symmetrically to the left and right, resulting in this violin-like shape. Also, violin plots typically don’t have whiskers or separately drawn outliers.
Now you have seen most of the basic plots! I encourage you to practice what you learnt on a different data set. Below you will find a few exercises that you can do in your own time. Like with the small tasks before, I will embed the end result underneath the task so you know what you should achieve. The solutions will be below the plot but I highly recommend that you try your best to figure it out yourself! That being said, I encourage you to do internet searches if you don’t know how to do something.
The PlantGrowth data set contains the weight of some
plants after 2 different treatments and in a control group.
Explore the data! Use View(), str(),
head() and so on to get an idea of what you are dealing
with.
As you will see, the PlantGrowth dataframe is kind of in
a long format. That’s ok for almost all plots in this exercise but I
will also provide you with a wide table, which you will need for the
scatter plot:
PlantGrowth_wide <- PlantGrowth %>%
mutate(measurement = rep(1:10, 3)) %>%
pivot_wider(names_from = group, values_from = weight)Show the weight for treatment 1 on the x axis and the weight for treatment 2 on the y axis.
Plot the median plant growth for the different treatments and the control
Plot plant growth for the different treatments and the control.
Plot the plant growth for the different treatments and the control (set the binwidth to 0.15).
Plot the plant growth for the different treatments and the control
(set color instead of fill and don’t set an
alpha):
When to use?
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
First, we will make a dataframe of iris for each species
separately. This will help us in the rest of the course to keep things a
bit more clean.
Let’s then create a dataframe with only 2 iris species and visualize the width of their sepals.
One of the assumptions of a t-test is that the data is normally distributed. Meaning that the data should be symmetric around the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. While visualizing, the normal distribution appears as a “bell curve”.
rbind(setosa, virginica) %>%
ggplot(aes(x=Sepal.Width, fill=Species))+
geom_histogram(binwidth = 0.08, alpha=0.8, position="identity") +
geom_density(alpha=0.5)Apart from visually checking if the data is normally distributed, a more robust method consists of testing the normality of the distribution with a statistical test. The Shapiro-Wilk test is suited for this, this test has as null hypothesis that the data is normally distributed. The for loop below tests the normality of sepal width for all species.
for (i in unique(iris$Species)){
normality_test <- shapiro.test(iris$Sepal.Width[iris$Species==i])
print(i)
print(normality_test)
}## [1] "setosa"
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Width[iris$Species == i]
## W = 0.97172, p-value = 0.2715
##
## [1] "versicolor"
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Width[iris$Species == i]
## W = 0.97413, p-value = 0.338
##
## [1] "virginica"
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Width[iris$Species == i]
## W = 0.96739, p-value = 0.1809
A second assumption that should be fulfilled before using a t-test is that the variances of the two groups are equal. The variance tells you something about the spread in your data, so how for each number in the dataset is from the mean. The F-test is used to test for equal variances. The null hypothesis of this test is that the variances between the two groups are equal.
variance_test <- rbind(setosa, virginica) %>%
var.test(Sepal.Width ~ Species, data=.)
variance_test$p.value## [1] 0.2613893
Finally, if all assumptions are met, perform the t-test. The t-test’s null hypothesis assumes that the means of the two groups are equal. If the p-value is < 0.05, the difference is statistically significant and it is very unlikely that we see this difference by chance.
## [1] 4.246355e-09
Let’s add this p-value to our boxplot with the
stat_compare_means function of the ggpubr
package.
rbind(setosa, virginica) %>%
ggplot(aes(x=Species, y=Sepal.Width, fill=Species))+
geom_boxplot()+
ggpubr::stat_compare_means(method = "t.test", label.y=4.5, method.args = list(var.equal=T))One sample t-test (test if the mean is significantly different from a specified value):
##
## One Sample t-test
##
## data: setosa$Sepal.Width
## t = 7.9839, df = 49, p-value = 1.011e-10
## alternative hypothesis: true mean is greater than 3
## 95 percent confidence interval:
## 3.338124 Inf
## sample estimates:
## mean of x
## 3.428
When to use?
Let’s test the normality of the Petal.Width variable for
again the setosa and virginica species.
##
## Shapiro-Wilk normality test
##
## data: setosa$Petal.Width
## W = 0.79976, p-value = 8.659e-07
##
## Shapiro-Wilk normality test
##
## data: virginica$Petal.Width
## W = 0.95977, p-value = 0.08695
The setosa’s petal widths are not normally distributed, this means that we can not use the t-test. Therefore, we can use a non-parametric equivalent of the t-test, i.e. the Wilcoxon rank sum test.
wilcox <- wilcox.test(virginica$Petal.Width,
setosa$Petal.Width, alternative = "two.sided")
wilcox$p.value## [1] 2.432193e-18
When to use?
We tested the normality of Sepal.Width for all species
here
and found that the sepal width of every species was normally
distributed.
A check for equal variances can be performed with the Bartlett or the Levene test.
##
## Bartlett test of homogeneity of variances
##
## data: Sepal.Width by Species
## Bartlett's K-squared = 2.0911, df = 2, p-value = 0.3515
One-way ANOVA compares the means between multiple groups and tells you if they are significantly different. ANOVA does not tell you which comparisons are different!
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
iris %>%
ggplot(aes(x=Species, y=Sepal.Width, fill=Species))+
geom_boxplot()+
ggpubr::stat_compare_means(label.y=4.5, method = "aov")When the ANOVA is statistically significant, you can proceed to test which groups are significantly different. To test which means are different across the groups, we have to perform multiple t-tests. Exercise: Perform a t-test for each comparison of the Sepal.Width measurement and store each p-value in a separate variable.
#Solution
setosa_versicolor <- t.test(setosa$Sepal.Width, versicolor$Sepal.Width)
setosa_virginica <- t.test(setosa$Sepal.Width, virginica$Sepal.Width)
versicolor_virginica <- t.test(versicolor$Sepal.Width, virginica$Sepal.Width)
pval1 <- setosa_versicolor$p.value
pval2 <- setosa_virginica$p.value
pval3 <- versicolor_virginica$p.value
comp1 <- t.test(iris[iris$Species == "setosa",]$Sepal.Width,
iris[iris$Species=="virginica",]$Sepal.Width,)
comp2 <- t.test(iris[iris$Species == "setosa",]$Sepal.Width,
iris[iris$Species=="versicolor",]$Sepal.Width,)
comp3 <- t.test(iris[iris$Species == "virginica",]$Sepal.Width,
iris[iris$Species=="versicolor",]$Sepal.Width,)
pval1 <- comp1$p.value
pval2 <- comp2$p.value
pval3 <- comp3$p.value
#Solution2
stat.test <- iris %>%
pivot_longer(-Species, names_to = "Measurement", values_to = "value") %>%
filter(Measurement=="Sepal.Width") %>%
rstatix::t_test(value ~ Species) %>%
rstatix::adjust_pvalue(method = "BH") %>%
rstatix::add_significance()
stat.testNOTE: the non-parametric equivalent for the ANOVA test is the Kruskal-Wallis test.
##
## Kruskal-Wallis rank sum test
##
## data: Petal.Width by Species
## Kruskal-Wallis chi-squared = 131.19, df = 2, p-value < 2.2e-16
When perdorming multiple statistical tests, the probability to make a type I error (falsely rejecting the null hypothesis) increases. Therefore, we need to correct the p-values for multiple testing. The most simple but quite stringent correction is the Bonferroni correction. This correction divides the significance level alpha by the number of tests. Another correction for multiple testing is through the false discovery rate (FDR). The FDR is the rate that differences called significant are truly null (so in reality not different). An FDR of 5% means that, among all the differences called significant, 5% of these are truly not different. An example is the Benjamini-Hochberg FDR.
## [1] 4.570771e-09 2.484228e-15 1.819483e-03
## [1] 1.371231e-08 7.452684e-15 5.458450e-03
## [1] 6.856157e-09 7.452684e-15 1.819483e-03
The ggpubr package provides the stat_pwc
function to easily visualize the significance across all
comparisons.
iris %>%
ggplot(aes(x=Species, y=Sepal.Width, fill=Species))+
geom_boxplot()+
ggpubr::stat_pwc(aes(x=Species, y=Sepal.Width, group=Species),
method = "t.test", p.adjust.method="BH", label = "p.adj")When to use?
To calculate the correlation between two variables, we can use the
cor function. This will output the correlation coefficient
which can go from -1 to 1, with a value closer to 1 meaning that there
is a strong correlation.
## [1] 0.9628654
We can fit a line through the data points, and the equation of this linear regression line is our model.
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33542 -0.30347 -0.02955 0.25776 1.39453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08356 0.07297 14.85 <2e-16 ***
## Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
Residuals
Residuals are essentially the difference between the actual observed response values (here the petal length) and the response values that the model predicted.
When assessing how well the model fit the data, you should look for a symmetrical (normal) distribution across these points on the mean value zero (0).
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.99011, p-value = 0.3753
Estimate
The coefficient intercept, in our example is the length of the petal if the width of the petal would be 0 cm.
Hypothetically, this would mean that a petal that is 0 cm wide, has a length of 1.1 cm. Of course, this does not make much sense in the real world because leafs with a width of 0 cm would be non-existent.
The second row in the Coefficients is the slope, or in our example, the effect that the petal width has on the its length. The slope term in our model is saying that for every cm of leaf width, the length increases with 2.2 cm.
Pr (>t)
The Pr(>t) is the p-value. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (petal width) and response (petal length) variables due to chance.
Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between the petal width and the petal length.
iris %>%
ggplot(aes(x=Petal.Width, y=Petal.Length))+
geom_point()+
geom_smooth(method = "lm")+
ggpubr::stat_cor(label.y = 6.5)+
ggpubr::stat_regline_equation(label.y=6)## `geom_smooth()` using formula = 'y ~ x'
Exercise: Perform a linear regresion analyis on the
R cars dataset. This dataset contains data on the stopping
distance of cars that are travelling at different speeds.
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
##
## Shapiro-Wilk normality test
##
## data: mod$residuals
## W = 0.94509, p-value = 0.02152
cars %>%
ggplot(aes(x=speed, y=dist))+
geom_point()+
geom_smooth(method="lm")+
stat_cor()+
stat_regline_equation(label.y = 100)## `geom_smooth()` using formula = 'y ~ x'